Aminet 41

home *** CD-ROM | disk | FTP | other *** search

/ Aminet 41 / Aminet 41 (2001)(Schatztruhe)[!][Feb 2001].iso / Aminet / util / cli / SED.lha / SED / SED.doc < prev next >

Wrap

Text File | 2000-12-09 | 19KB | 468 lines

Amiga SED An Amiga Stream Editor © THOR-Software (Thomas Richter) ______________________________________________________________________________ Purpose of this program: SED takes an input file, checks each line of this file against a pattern supplied on the command line, and generates a new line from this pattern match in the destination. This could either mean that the matching line is removed completely from the output, replaced by a different line, or changed according to the specifications of SED. SED is an approximation of the Unix "stream editor" sed. It is not quite as powerful as sed because its command set is currently very limited, and it does not support command files. Its pattern syntax is different, too. It still looks like "line noise" to me - I didn't want to break with this tradition - but it's at least the Amiga kind of line noise. Its pattern matching rules are a superset of the AmigaOs patterns, with some additional features like "captured expressions" and more powerful "character classes" and "escaping". SED is useful for automatic processing of text files, e.g. the modi- fication of the startup-sequence. SED can also be run as a "filter" in which case it reads its input from stdin and prints output to stdout. Combine this feature with pipes and you get a very powerful text processing tool. A warning: Pattern matching looks simple, but is full of hard to gasp traps. This tool is therefore thought to be for "expert usage". In case you think SED doesn't process your pattern correctly, think twice! ______________________________________________________________________________ SYNOPSIS: SED FROM,TO,MATCH/A,REPLACE,CHANGE,DELETE/S,USECASE/S,ALL/S,VERBOSE/S FROM An (AmigaOs) pattern specifying the input file(s) to process. If not given, SED reads from the standard input. TO The output file where to write the processed lines to. If not given, SED writes to the standard output. MATCH A pattern specification used for filtering the input lines. More on the pattern rules below. The next options specify what to do with the lines matching the pattern: REPLACE Replace matching lines in the input by this replace rule, do not write non-matching lines to the output. The replacement rules are given below. CHANGE Replace matching lines by the replace rule given by this ex- presson. In contrast to REPLACE, non-matching lines are placed in the destination without change. Useful for modifying a file according to a pattern rule. DELETE Print all lines except those matching the pattern. This effectively removes the matching lines from the input file. USECASE Be case-sensitive. By default, SED is case-insensitive. Note that SED differs in this detail from the Un*x sed. ALL In case the FROM pattern is a wildcard, enter sub-directories recursively. VERBOSE Print information about the file currently scanned, and upon entering a directory. Prints also file name and line number information for found matches. By default, SED is quiet. Note that "Search" is by default not quiet. ______________________________________________________________________________ Pattern specification: In the following, the syntax of the patterns is specified. By good tradition, this is in-comprehensively. I present first a "quick and dirty" presentation of the available patterns as a quick reference which might give you an impression about the possibilities. It is all but sufficient to work with SED. Then, a detailed and more precise, but also more confusing presentation follows. ______________________________________________________________________________ Quick guide to patterns: SED patterns work much like Amiga patterns. Unlike in "Search", a pattern is applied to a FULL line, and not to sub-strings of this line. Which means that the pattern "hello" matches ONLY the line containing the single word "hello", nothing more. If you want to match lines containing "hello", use "#?hello#?" instead - see below for what "#?" means. Unlike Un*x sed, there are no special characters to match the start or the end of a line. They are not required in the SED approach. Standard patterns: ? Matches a single arbitrary character. # Matches zero or more repetitions of the following symbol in the AmigaOs sense. Note that # may match zero(!) characters as well. Therefore, #? matches an arbitrary sequence of at least zero characters, hence any string. + Matches one or more repetitions of the following symbol New to AmigaOs, standard Un*x regular expression. * Matches zero or more arbitrary characters in the sense of MS-DOS. Note that this is a functional difference to Un*x regular expressions where * has the meaning of #. Note that you need to write ** instead of * if you enclose the pattern in double quotes on the shell command line. This is because * is also the BCPL escape character. Messy. (...) Groups the characters in the bracket to a single symbol. As for example, #(ab) would match an arbitrary repetition of "ab", as the empty string, "ab", "abab" or "ababab", but not "aba". Brackets can be nested. (..|..) The vertical bar means "or". Matches either the left or the right string. The bar is only valid within brackets. {...} Groups expressions much like (...) but captures the contents of the sub string that matched the brackets. This captured expression is then available for the ouput replacement rules, see below for more information. For example, SED MATCH {#?}.c REPLACE {1}.o would match all lines ending on ".c", and would capture the string in front of the ".c". The "{1}" in the replace pattern would insert this string, and would append an ".o". Namely, the above replaces all lines ending on ".c" by a similar line ending by ".o". Works very much line the Un*x $..$ matching. {..|..} The vertical bar works right the same way here as described above. Matches either the left or the right expression, and captures the expression that fits. % Matches the empty string. Useful for patterns like "#?(.c|.o|%)" which could be used to match the source, the object code and the final executable of a C project, for example. ~ Means "not" and matches all symbols that do not match the following symbol. Be warned, ~ is full of traps, see below for the full description. [..] Character classes. Matches a single character on a range of valid characters specified in the interior of the bracket. For example, "[ac]" would match the single character "a" or "c". [..|..] Matches either the left or the right character range. Hence, [a|c] is equivalent to [ac]. [..,..] Another equivalent formulation of the above. [a,c] is the same as [ac] or [a|c]. [..-..] Matches a character range. [a-z] matches all letters - except language specific "Umlaute", though, which have different en- codings. Several ranges can be grouped much the same way as single characters. [a-z|0-9] means "any character or any digit" but nothing else. [-..] Matches all characters up to the specified character. Hence, [-z] means "all characters up to z". Note that unlike in Un*x implementations, there are no messy rules concering the "[" itself as character. The escape character "\" must be used to specify "[" or "]" itself, see below. This syntax can be combined freely with "|" or "," to specify more than one range. [..-] Matches all characters starting at the given ASCII value. Can be combined freely with "," and "|". There are no messy rules concerning "-" in the middle or the end of a character range, proper escaping must be used if "]" should be matched. [a-] matches therefore all characters "a" and up. [~..] Matches all characters not in the following range. ~ is applied up to the next "|" or ",", unlike in the standard AmigaOs (Arp) expression matching. Therefore, [~ab] matches all characters except "a" and "b" and is equivalent to [~a,~b] and [~a|~b]. \ Escape character. Specifies a character to be matched: \t Tabulator \v Vertical TAB \b Backspace \r CR \f Form Feed \a Bell \n is INVALID since the end of the line is matched by the end of the pattern itself. \x.. The character encoded by the hex value following the "x". In case this specification is ambigious, the number might be terminated by a dot ".". Hence, "\x9.0" matches a tabulator sign and the digit "0", whereas "\x90" matches the ASCII char- acter of the code hex 90. Note that this rule differs from the ANSI-C rule. \0.. The character of the ASCII code encoded as an octal number. The dot is used as above as separator, unlike in ANSI-C. \$.. Identical to \x.., matches the digit encoded by the ASCII code in hex. \d Matches the dollar sign since \$ has a different meaning already. \#.. Matches the character encoded by the ASCII code in decimal notation. \h Matches the hash-mark since \# has a different meaning already. Everything else: The character following the backslash itself. Especially, \\ is the backslash itself and \" is the double quote. Note that you must use the backslash to match characters which are otherwise part of the pattern syntax, as for example "\(" to match the bracket. Note that "#" and "$" are special in this sense since "\$" and "\#" are used to specify characters by ASCII code. !,",§,$,&,= -,^,',`,<,> are reserved for future use AND MUST NOT be used at all. Escape them if you need them. However, the dot (".") is free, unlike Un*x regexp, same goes for "@" and "/". .. Everything else: Matches the character itself. Hence "a" matches a single "a" much like "[a]". ______________________________________________________________________________ Replacement rules: The arguments of REPLACE and CHANGE specify what do with the lines which matched the specified pattern. Unlike the pattern specification, only the special operators \ and {..} are allowed. All other operators from the above list are forbidden and generate an error. \ Escape character, works identically to the \ in the pattern and places the single character encoded by the sequence following the backslash on the output directly. {..} Specifies a captured expression to be inserted into the output stream. The brackets take up to three arguments: The couting number of the regular expression, and optionally two arguments how to format the regular expression separated by a dot ".". These numbers work very much the same way like the arguments to the %s format specifier in ANSI-C. The first number in the bracket describes which captured expression to insert. If it is a positive number, the number is simply the index of the captured expression, counting from one upwards. Each opening bracket "{" in the input pattern starts a new captured expression, hence in nested expressions the other- most bracket has the lowest index. If this number is negative, it counts the captured ex- pressions downwards from the last expression. If the specified expression does not exist, the brackets expand into an empty string that is formatted according to the rules given by the next three arguments. {1} is the first captured expression, {3} is the third expression, {-1} is the last expression, {-2} is the second to last expression. The next number is the field with to print the captured ex- pression in. At least the specified number of characters are printed, or more if the expression is longer. If the ex- pression is too short, the field is padded with blank spaces. The expression is right-justified into this field, unless the field width is negative in which case the expression is left-justified. The sign of the field width is otherwise ignored. Defaults to 0, i.e. the field is always as small as possible. The last number is the size limit of the expression. The expression will be cut down if it is longer than the specified limit. SED will cut the end of the string if this argument is positive, or the start of the string if it is negative. The sign of the limit is otherwise ignored. If the limit is 0, which is the default, the expression will not be cut down at all. {1.10} is the first captured expression right justified in a field of ten characters or longer. {2.-5.7}is the second captured expression, left justified in a field of five characters. At most seven characters of the expression will be printed. .. Everything else: The character itself is printed on the ouput. ______________________________________________________________________________ Detailed pattern matching rules: And now for the detailed rules to confuse you completely: - A SYMBOL is either a single character, one of the following operators followed by its arguments, a character class [..] or a (..) or {..} sequence. - A PATTERN is a sequence of SYMBOLs. - The POSTFIX of a symbol in a pattern is the subsequence of the pattern following the symbol, not including the argument of the symbol itself. A POSTFIX is always a PATTERN itself. ? Matches a single character except the end of a string. # Matches as many repetitions of the following SYMBOL, but at least zero such that the POSTFIX of the symbol matches the remaining input. Hence, "#" is greedy. There is currently no non-greedy form. + Matches as many repetitions of the following SYMBOL but at least one such that the POSTFIX of the SYMBOL matches the remaining input. "+" is greedy. * is fully equivalent to "#?" and therefore greedy. (...) groups the PATTERN up to the next | or ) into a SYMBOL which matches if the contents of the brackets match. (..|..) An or-combined SYMBOL matches if one of the PATTERNS in the bracket match such that the POSTFIX matches the remaining input. {...},{..|..} Similar to the above except that the matched string is captured. % Is completely ignored as pattern and gobbles no character from the input sequence at all. ~ Matches the longest subsequence or at least zero characters that does not match the following SYMBOL such that the POSTFIX of the SYMBOL still matches the remaining input. "~" is greedy and will try to match as many characters first. Note that a SYMBOL could either be a single character or a sequence of characters grouped by () or #. Since a single character cannot match a string larger or smaller than one character, ~ followed by a one-character symbol will match all subsequences except those whose postfix either don't match the postfix of the character, or which match the character and the postfix. This is *very* tricky and you should think about the con- sequences of this rule twice. More examples below. [..] Character classes. Groups a range of characters into a SYMBOL that matches exactly a single character, but never the empty string. ~ in character classes is special: If there is a not-sequence in a character class, it matches if all not-sequence match at once and one or more of the ordinary sequences match. Hence [~p,~q,a-z] matches all letters except p and q. .. Everything else matches exactly the the one character that it represents. They will not match the empty string. ______________________________________________________________________________ Some examples of patterns to think of: % Matches only empty lines in the input. ~% Matches only non-empty lines in the input. #?.c Matches all lines ending on ".c" The#? Matches all lines starting with "The". #?Example#? Matches all lines containing the word "Example". Example#? Matches all lines starting with the word "Example". Example Matches all lines consisting of the single word "Example". #? Matches all lines. #?(.c|.o|%) Matches all lines (think about why!). foo(.c|.o|%) Matches all lines consisting entirely of the word "foo", "foo.c" or "foo.o". foo(.c|.o|) Just the same. ~(Example) Matches all lines except the line consisting of the single word "Example". ~(#?Example#?) Matches all lines that do not contain the word "Example". ~(ab)cd Matches all lines that do not start with "ab" and that end on "cd". Especially, this would match "bccd". It would also match the line "cd" since "ab" does not match the empty sequence in front of "cd". (think about this!) ~#a.c Matches all lines ending by ".c" except those where the ".c" is prefixed by an arbitrary number of a's, including zero a's. Hence, it would match "bc.c" and even "ab.c", but not "a.c" or ".c" as the last consists of zero a's and one ".c". It would not match "bc.o". This is identically to ~(#a).c since # binds the following a. ~(#ab)#? Matches all lines except those starting with a possibly empty sequence of a's followed by a single b. Hence, does not match aaabccc or bccc. ~(#[ ,\t];)#? Matches all lines except those starting with a possibly empty sequence of blanks or tabulators followed by a colon. Hence, for a shell script, this would match all non-comment lines. ~(#[ ,\t];)if#? This is a tricky one. Unlike what you might think, this does not match all non-comment lines starting with if. It also matches lines starting with a semicolon provided the string "if" is in the line and not directly behind the semicolon. Note that this is the intended behaivour. For example, it would match ;aifb The reason is simple: This is a string ";a" that does not match the symbol #[ ,\t]; followed by "ifb" which matches if#?. What you want here instead is #[ ,\t]if#? which matches all if-lines with additional, at least zero, blanks or tabs in front. The above example shows again the tricky nature of pattern matching. A real life example would be sed from S:Startup-Sequence match "{#[ ,\t]}RunBack{#?}" change "{1}Launch{2}" which would replace all invocations of "RunBack" in the Startup-Sequence by similar invocations of "Launch". Another example to think about as exercise: ({}|{#[~;]+[ \t]}#[~; \t][/:]|{}#[~; \t][/:]|{#[~;]+[ \t]})FooBar{|[ \t;]#?} Yes, this pattern is useful. Consider again S:Startup-Sequence as input file and think about what this could possibly do. Note that some expressions are captured. (Hey, I said this would look like line noise!) ______________________________________________________________________________ Thomas Richter, December 2000